Automatic configuration of logical routers on edge nodes

ABSTRACT

Some embodiments provide a method or tool for automatically configuring a logical router on one or more edge nodes of an edge cluster (e.g., in a hosting system such as a datacenter). The method of some embodiments configures the logical router on the edge nodes based on a configuration policy that dictates the selection method of the edge nodes. In some embodiments, an edge cluster includes several edge nodes (e.g., gateway machines), through which one or more logical networks connect to external networks (e.g., external logical and/or physical networks). In some embodiments, the configured logical router connects a logical network to an external network through the edge nodes.

CLAIM OF BENEFIT TO PRIOR APPLICATIONS

This present Application is a continuation application of U.S. patentapplication Ser. No. 15/420,043, filed Jan. 30, 2017, now published asU.S. Patent Publication 2017/0317954. U.S. patent application Ser. No.15/420,043, claims the benefit of Indian Patent Application201641014866, filed Apr. 28, 2016. U.S. patent application Ser. No.15/420,043, now published as U.S. Patent Publication 2017/0317954 isincorporated herein by reference.

BACKGROUND

Typical physical networks contain several physical routers to perform L3forwarding (i.e., routing). When a first machine sends a packet to asecond machine located on a different IP subnet, the machine sends thepacket to a router that uses a destination IP address of the packet todetermine through which of its physical interfaces the packet should besent out. In logical networks, user-defined data compute nodes (e.g.,virtual machines) on different subnets also communicate with each otherthrough logical switches and logical routers. A user (e.g., a datacenternetwork administrator, etc.) defines the logical elements (e.g., logicalswitches, logical routers) for a logical network topology. For a logicalrouter that connects the logical network to one or more externalnetworks, the user has to manually specify the edge nodes on which thelogical routers are configured.

BRIEF SUMMARY

Some embodiments provide a method or tool for automatically configuringa logical router on one or more edge nodes of an edge cluster (e.g., ina hosting system such as a datacenter). The method of some embodimentsconfigures the edge nodes to implement the logical router based on aconfiguration policy that dictates the selection method of the edgenodes. In some embodiments, an edge cluster includes several edge nodes(e.g., gateway machines), through which one or more logical networksconnect to external networks (e.g., external logical and/or physicalnetworks). In some embodiments, the configured logical router connects alogical network to an external network through the edge nodes.

The logical router, in some embodiments, includes one distributedrouting component (also referred to as a distributed router or DR) andone or more service routing components (each of which is also referredto as a service router or SR). The distributed routing component isimplemented in a distributed manner by numerous machines within thenetwork (e.g., a hosting system network), while each service routingcomponent is implemented by a single edge node. The method of someembodiments configures a logical router for the logical network by (1)configuring the distributed component of the logical router on severaldifferent host machines as well as one or more edge nodes, and (2)configuring the service components of the logical router only on theedge nodes.

The management plane cluster (e.g., a manager computer in the cluster, amanager application, etc.) of a logical network receives the logicalnetwork topology from a user (e.g., a tenant of a hosting system, anetwork administrator of the hosting system, etc.) in some embodiments.The user provides the logical network definition (e.g., logical networktopology) to the management plane through a set of applicationprogramming interface (API) calls. The management plane, based on thereceived logical network definition, generates the necessaryconfiguration data for the logical forwarding elements (e.g., logicalswitches, logical routers, logical middleboxes, etc.). The managementplane then pushes the configuration data to a control plane of thenetwork (e.g., to one or more controller machines or applications of thecontrol plane). Based on the generated configuration data, themanagement and control planes configure the logical forwarding elementson a set of physical nodes (e.g., host machines, gateway machines, etc.)that implements the logical network.

When the logical network topology is connected to an external network,the management plane (e.g., a manager machine) of some embodimentsautomatically determines which edge nodes in the edge cluster are theideal candidates for implementing the logical router(s) of the logicalnetwork. That is, the management plane identifies the best edge nodecandidate on which, a service component of the logical router can beinstalled (configured). In some embodiments, the management plane makessuch a determination based on a configuration policy. That is, in someembodiments, the management plane receives a configuration policy from auser (e.g., the network administrator) and based on the receivedconfiguration policy, identifies the best edge nodes on which thelogical routers of the logical network can be implemented. Themanagement plane then configures the identified edge nodes to implementthe logical router automatically (i.e., without any user interventionand only based on the configuration policy).

A logical router is deployed in a logical network topology in anactive-active mode or active-standby mode in some embodiments. In theactive-active mode, the management plane applies the same configurationrules for placement of each service router on an edge node in someembodiments. For active-standby mode, however, some embodiments maydefine two different sets of rules for the active and standby edge nodeselections. Some other embodiments may define the same set of rules forconfiguration of both active and standby service routers on the edgenodes. The configuration policy may also specify a static bindingbetween active and standby edge node selections. For example, a user maydefine a rule in the logical router configuration policy which specifieswhen an active SR is configured on a first edge node, a secondparticular edge node should host a corresponding standby SR.

Before the management plane selects one or more edge nodes of an edgecluster as the candidates to configure the logical router, themanagement plane of some embodiments identifies the edge nodes on whichthe logical router should not and/or could not be realized. In some suchembodiments, after excluding a set of disqualified edge nodes, themanagement plane starts analyzing the remaining edge nodes forconfiguring the logical router. The management plane disqualifies theedge nodes based on a set of constraining rules. These constrainingrules, in some embodiments, includes user defined constraints, physicalconstraints, and product constraints.

The configuration policy, in some embodiments, includes a set of rulesthat determines the selection of the edge nodes (e.g., gatewaymachines). In some embodiments, the set of rules is ordered based on aranking that is assigned (e.g., in the configuration policy) to eachrule. In some such embodiments, the management plane first tries toidentify an edge node that matches the highest ranked rule. That is, themanagement plane identifies the gateway machine that satisfies thespecification that is set forth in the rule. In some embodiments, whenthe management plane does not find any matching edge node for thehighest ranked rule, the management plane tries to find an edge nodethat matches the next highest ranked rule in the set of rules. If noneof the edge nodes satisfies any of the policy rules, the managementplane of some embodiments selects a first available edge node on whichthe logical router (i.e., service components of the logical router) canbe configured.

In some embodiments, the configuration policy includes a rule thatselects the next edge node in a sequential order (i.e., in the roundrobin approach). The configuration policy, in some embodiments, alsoincludes rules that select an edge node that has been least recentlyused; an edge node that has the lowest connections (to the externalnetworks); an edge node, through which, the lowest network trafficpasses (i.e., the lowest number of packets is sent, through the edgenode, to the external networks); an edge node, on which, the lowestnumber of logical routers (irrespective of their capacity) isconfigured; an edge node that has the lowest amount of aggregatedcapacity of configured logical routers; an edge node, on which otherlogical routers of the logical network (e.g., on a north-south path)have already been configured; and an edge node, on which, the leastnumber of logical routers of other logical networks is configured (e.g.,to ensure fault isolation between the tenants). In some otherembodiments, the configuration policy includes other rules based onwhich the user chooses to configure the logical routers on the edgecluster.

The management plane of some embodiments deploys one or more databases(in one or more data storages) that keep the necessary statistical data,based on which, the management plane decides which edge node is the bestcandidate for configuring the defined logical routers. For instance, ina round robin approach, the management plane starts with the firstavailable edge node and selects each next edge node based on the datathe management plane retrieves from a data storage that keeps track ofedge nodes that have not been selected yet. As another example, thedatabase, in some embodiments, keeps track of the number of connectionsthat each edge node has established to the external networks. This way,when the policy specifies that the best candidate is an edge node thathas the least number of connections, the management plane identifies thebest candidate by querying the database for an edge node with the leastnumber of connections.

In some embodiments, the management plane queries the edge nodes of theedge cluster in order to receive these statistical data and stores thedata in the database. In some embodiments, each time the managementplane configures an edge node to implement a logical router (i.e., aservice router of the logical router) or removes a logical router froman edge node, the management plane updates the statistical data in thedatabase. Yet, in some other embodiments, the management plane employsboth of these methods. That is, the management plane updates thedatabase with each new transaction (e.g., addition or deletion of anSR), and at the same time, the management plane queries the edge nodesupon occurrence of an event (e.g., within certain time intervals) toreceive more precise information to store in the database.

Some embodiments load balance the logical routers that are configured onthe different edge nodes of an edge cluster. That is, in someembodiments, the management plane, based on occurrence of an event(e.g., user request, lapse of certain time period, node failure, etc.)identifies the logical routers that are implemented by each edge node,and based on the configuration policy, reassigns the different logicalrouters to different edge nodes. For instance, when an edge node failsor shuts down, the management plane of some embodiments automaticallyreassigns the implementation of the logical routers (among other logicalentities) between the edge nodes that are still active in the edgecluster. In some embodiments, one or more changes in the configurationpolicy (e.g., addition of a new rule, deletion or modification of acurrent rule, etc.) could be the triggering event for reconfiguring thelogical routers on the edge nodes of the edge cluster.

The preceding Summary is intended to serve as a brief introduction tosome embodiments of the invention. It is not meant to be an introductionor overview of all of the inventive subject matter disclosed in thisdocument. The Detailed Description that follows and the Drawings thatare referred to in the Detailed Description will further describe theembodiments described in the Summary as well as other embodiments.Accordingly, to understand all the embodiments described by thisdocument, a full review of the Summary, Detailed Description and theDrawings is needed. Moreover, the claimed subject matters are not to belimited by the illustrative details in the Summary, Detailed Descriptionand the Drawing, but rather are to be defined by the appended claims,because the claimed subject matters can be embodied in other specificforms without departing from the spirit of the subject matters.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the invention are set forth in the appendedclaims. However, for purposes of explanation, several embodiments of theinvention are set forth in the following figures.

FIG. 1 conceptually illustrates a physical network topology thatconnects one or more logical networks, which are implemented on thephysical network infrastructure, to one or more external networks.

FIG. 2 illustrates a configuration view of a logical router, whichrepresents a logical network as designed by a user.

FIG. 3 illustrates a management plane view of a logical network when thelogical router is implemented in a distributed manner.

FIG. 4 illustrates a physical distributed implementation of a logicalrouter defined for a logical network.

FIG. 5 conceptually illustrates a process of some embodiments thatconfigures a logical router on one or more edge nodes of a physicalinfrastructure that implements one or more logical networks.

FIGS. 6A-6B illustrate a manager of a logical network that receives thelogical router configuration policy and logical network's definitionfrom one or more users and configures the logical routers of the logicalnetwork on one or more edge nodes based on the configuration policy.

FIG. 7 conceptually illustrates a process of some embodiments thatautomatically selects an edge node of an edge cluster based on aconfiguration policy and configures a logical router (i.e., a servicerouting component of the logical router) on the selected edge node.

FIG. 8 conceptually illustrates a process of some embodiments thatdetermines whether an edge node matches a selected rule, and configuresa logical router on the edge node when the node satisfies the specifiedrequirement in the rule.

FIG. 9 conceptually illustrates a multi-tier logical network withlogical routers that are situated in different tiers of the logicalnetwork.

FIG. 10 illustrates the management plane view for the logical topologyof FIG. 9 when a Tenant Logical Router (TLR) in the logical network iscompletely distributed.

FIG. 11 illustrates the management plane view for the logical topologyof FIG. 9 when the TLR in the logical network includes at least oneservice routing component (e.g., because stateful services that cannotbe distributed are defined for the TLR).

FIG. 12 illustrates a logical network topology that is connected to anexternal network through a set of logical routers that are positioned inthe same or different layers of the logical topology.

FIGS. 13A-13B illustrate the physical implementation of the logicalnetwork elements shown in FIG. 12, on different edge nodes of an edgecluster.

FIG. 14 conceptually illustrates a process of some embodiments that loadbalances the implementations of different logical routers of one or morelogical networks among the different edge nodes of an edge cluster.

FIG. 15 conceptually illustrates an electronic system with which someembodiments of the invention are implemented.

DETAILED DESCRIPTION OF THE INVENTION

In the following detailed description of the invention, numerousdetails, examples, and embodiments of the invention are set forth anddescribed. However, it should be understood that the invention is notlimited to the embodiments set forth and that the invention may bepracticed without some of the specific details and examples discussed.

Some embodiments provide a method and tool for automatically configuringa logical router on one or more edge nodes of an edge cluster (e.g., ina hosting system such as a datacenter). The method of some embodimentsconfigures the logical router on the edge nodes based on a configurationpolicy that dictates the selection method of the edge nodes. In someembodiments, an edge cluster includes several edge nodes (e.g., gatewaymachines), through which one or more logical networks connect toexternal networks (e.g., external logical networks and/or externalphysical networks).

In some embodiments, the configured logical router connects a logicalnetwork to an external network through the edge nodes. The logicalrouter, in some embodiments, includes one distributed routing component(also referred to as a distributed router or DR) and one or more servicerouting components (each of which is also referred to as a servicerouter or SR). The distributed routing component (DR) is implemented ina distributed manner by numerous machines within the network (e.g., ahosting system network), while each service component (SR) isimplemented by a single edge node.

The method of some embodiments configures a logical router for thelogical network by (1) configuring the DR of the logical router onseveral different host machines as well as one or more edge nodes, and(2) configuring the SR(s) of the logical router only on the edge nodes.A logical router that is automatically configured on an edge nodeincludes an SR that is directly connected to an external network (i.e.,the service router belongs to a logical router that is at the edge ofthe logical network topology), or an SR that is required to beimplemented by the edge node (e.g., the service router belongs to alogical router that is in a middle layer of the logical network topologyand provides stateful services).

As will be described in more detail below, when a middle layer logicalrouter (e.g., a second tier logical router in a multi-tier networktopology) is a fully distributed logical router and does not providestateful services (e.g., network address translation (NAT), statefulfirewall, load balancing, etc.), the logical router is not required tobe implemented by an edge node (since the logical router does not havean SR, or for any other reason does not provide stateful services). Someembodiments do not configure such a logical router on any edge node ofthe edge cluster.

In some embodiments, the management plane cluster (e.g., a managermachine in the cluster, a manager application, etc.) of a logicalnetwork receives the logical network topology from a user (e.g., atenant of a hosting system, a network administrator of the hostingsystem, etc.). The user provides the logical network definition (e.g.,logical network topology) to the management plane through a set ofapplication programming interface (API) calls in some embodiments. Themanagement plane, based on the received logical network definition,generates the necessary configuration data for the logical forwardingelements (e.g., logical switches, logical routers, logical middleboxes,etc.). Based on the generated data, the management plane configures thelogical forwarding elements on a set of physical nodes (e.g., hostmachines, gateway machines, etc.) that implements the logical network.

The management plane of some embodiments also pushes the generatedconfiguration data to a control plane (e.g., one or more controllers ina central control plane (CCP) cluster) of the logical network. Thecontrol plane, in some embodiments, modifies the configuration of thelogical forwarding elements (LFEs) on the physical nodes that implementthe LFEs at runtime. That is, based on the generated configuration datathat the control plane receives from the management plane and theruntime data that the control plane receives from the physical nodes,the control plane modifies the configuration of the LFEs on the physicalnodes. In some embodiments, as will be described in more detail below,the management and control planes configure the LFEs on a physical nodeby configuring a managed forwarding element (MFE) that executes on thephysical node (e.g., in the virtualization software of the physicalnode) to implement the LFEs of the logical network.

A logical network topology, in some embodiments, includes a set oflogical network entities that are placed on different logical paths ofthe network. Examples of logical network entities in a logical networkinclude logical forwarding elements (e.g., logical L2 and L3 switches,logical routers), logical middleboxes (e.g., logical firewalls, logicalload balancers, etc.), and other logical network elements such as asource or destination data compute node (DCN) and a tunnel endpoint(e.g., implemented by an MFE). While a DCN or tunnel endpoint typicallyoperates on a single host machine, a logical forwarding element orlogical middlebox spans several different MFEs (e.g., software and/orhardware MFEs) that operate on different machines (e.g., a host machine,a top of rack hardware switch, etc.).

The logical forwarding elements of a logical network logically connectseveral different DCNs (e.g., virtual machines (VMs), containers,physical machines, etc.) that run on different host machines, to eachother and to other logical and/or physical networks. The logicalforwarding elements that logically connect the DCNs, in someembodiments, are part of a logical network topology for a user (e.g., atenant) of a hosting system (e.g., a datacenter). In some embodiments,different subsets of DCNs reside on different host machines that executesoftware managed forwarding elements (MFEs). Each MFE, as stated above,executes on a physical node (e.g., a host machine) and implements theLFEs of the logical network to which a subset of DCNs that runs on thehost machine is logically connected.

A software MFE, in some embodiments, is a software application and/orprocess that executes in a virtualization software (e.g., a hypervisor)of the physical node. Implementing the LFEs on a host machine, in someembodiments, includes performing network traffic forwarding processingfor the packets that are originated from and/or destined for a set ofDCNs that resides on the host machine on which the MFE operates. TheLFEs are also implemented by one or more hardware MFEs (e.g., Top ofRack (TOR) switches) in some embodiments, in order to logically connectthe physical machines (e.g., servers, host machines, etc.) that areconnected to the hardware MFEs to other DCNs of the logical network.Additionally, as a particular physical host machine may host DCNs ofmore than one logical network (e.g., belonging to different tenants),the software MFE running on the host machine (or a hardware MFE) mayimplement different sets of LFEs that belong to different logicalnetworks.

In some embodiments, as described above, the management plane (e.g., amanager machine or application) generates the logical network entities'data (i.e., the desired state) for a logical network topology. Themanagement plane configures the logical network entities on differentphysical nodes based on the desired state. That is, a manager machine orapplication configures the virtualization softwares that run on thephysical nodes to implement the logical network entities. The managementplane also pushes the desired state to one or more controllers in theCCP cluster. The MFEs (e.g., MFEs operating in the host machines andgateway machines) also push runtime data related to LFEs that the MFEsimplement (i.e., the discovered state of the LFEs) to the CCP cluster.

The CCP cluster processes the logical entity definition data (i.e., thedesired state) received from the management plane along with the runtimedata received from the MFEs (i.e., the discovered state) in order togenerate configuration and forwarding data for the logical entities thatare implemented on the MFEs at runtime. The configuration and forwardingdata that is distributed to the physical nodes defines common forwardingbehaviors of the MFEs that operate on the physical nodes in order toimplement the LFEs. In some embodiments, a local controller thatoperates on each physical node (e.g., in the hypervisor of a hostmachine) receives the configuration and forwarding data from the CCPcluster first.

The local controller then generates customized configuration andforwarding data that defines specific forwarding behaviors of an MFEthat operates on the same host machine on which the local controlleroperates and distributes the customized data to the MFE. The MFEimplements the set of logical forwarding elements based on theconfiguration and forwarding data received from the local controller.Each MFE can be connected to several different DCNs, different subsetsof which may belong to different logical networks for different tenants.As such, the MFE is capable of implementing different sets of logicalforwarding elements for different logical networks. The MFE implementsan LFE by mapping the LFE ports to the physical ports of the MFE in someembodiments.

FIG. 1 conceptually illustrates a physical network topology 100 thatconnects one or more logical networks implemented on the physical nodesof the network to one or more external networks. More specifically, thisfigure shows different physical nodes such as host machines, gatewaymachines, managers, and controllers of a physical network (e.g.,physical network of a datacenter) that implement logical networkentities of different logical networks and that connect the logicalnetworks to external networks. FIG. 1 includes management and controlclusters 105, an edge cluster 110, an external network 170, and two hostmachines 135 and 140. Each of the host machines shown in the figureincludes a managed forwarding element 145 (MFE1 and MFE2), a localcontroller 160 (LC1 and LC2), and a set of data compute nodes 150(VM1-VM4).

In some embodiments, the MFEs 145 are implemented in the virtualizationsoftware (e.g., hypervisor) of the host machines 135 and 140 (thehypervisors are not shown in the figure for simplicity of description).The management cluster includes a set of managers 115, while thecontroller cluster (CCP cluster) includes a set of controllers 120. Theedge cluster 110 includes a set of edge nodes (e.g., gateway machines)125 that handle north-south traffic of the logical networks (e.g.,connects the logical networks implemented on the physical network to theexternal network 170).

For example, a logical network, which logically connects the VMsexecuting on the host machine 130 to the VMs that execute on the hostmachine 140, can be connected to the external network 170 through one ormore gateway machines 125 of the edge cluster 110. The logical network(that includes e.g., a set of logical switches and logical routers) isconfigured and managed by the management and control clusters 105. Thelogical switches and routers of the logical network is implemented bythe MFEs 145 that run on the host machines and a set of MFEs (not shownin this figure) that runs on the edge nodes of the edge cluster 170.

Each of the managers 115 and controllers 120 can be a physical computingdevice (e.g., a server, a computer, etc.), a data compute node (DCN)such as a virtual machine (VM), a container, etc., or a softwareinstance (or a process) operating on a physical computing device or DCN.In some embodiments, a manager includes different user interfaceapplications for administration, configuration, monitoring, andtroubleshooting one or more logical networks in the physical networkinfrastructure (e.g., a hosting system network). A subset of one or morecontrollers of some embodiments controls the data communications betweenthe different MFEs that implement the logical network elements of thelogical network.

As described above, the CCP cluster (e.g., one or more controllers 120)controls the network data communication between the different DCNs of alogical network (e.g., between the VMs 150 in the illustrated example)by controlling the data communication between the MFEs 145. The CCPcluster communicates with the MFEs 145 in order to control the dataexchange between the MFEs since the MFEs also implement virtual tunnelendpoints (VTEPs) that ultimately exchange the logical network databetween the DCNs. In order to control the data exchange, the CCP clusterof some embodiments receives runtime data for the logical networkentities (e.g., VMs 150, LFEs of the logical network, etc.) from each ofthe MFEs. The CCP cluster 120 also receives the logical topology data(i.e., the desired state of the logical network) from the managementcluster (e.g., a manager 115) and uses the desired state data along withthe runtime data in order to control the data communications of thelogical network.

Typical logical network definition data, in some embodiments, includesdata that defines the location of DCNs (e.g., the distribution of VMs onhost machines), data that defines connection topology between the DCNsand locations of the LFEs in the topology, data that defines middleboxservices, which are applied to the LFEs (e.g., distributed firewallpolicies), etc. Typical runtime data, in some embodiments, includeslayer 2 control plane tables such as virtual tunnel endpoint (VTEP)tables, media access control (MAC) tables, address resolution protocol(ARP) tables; layer 3 routing tables such as routing information base(RIB) tables, forwarding information base (FIB) tables; statistics datacollected from MFEs, etc.

In some embodiments, the local controller 160 of each hypervisor of thehost machines receives logical network data from a controller 120 of theCCP cluster. The local controller 160 then converts and customizes thereceived logical network data for the local MFE 145 that operates on thesame machine on which the local controller operates. The localcontroller then delivers the converted and customized data to the localMFEs 145 on each host machine. In some embodiments, the connections ofthe end machines to an LFE (e.g. a logical switch) are defined usinglogical ports of the LFE, which are mapped to the physical ports of theMFEs (e.g., a first logical port of the LFE is mapped to a physical portof MFE1 that is coupled to VM1 and a second logical port of the LFE ismapped to another physical port of MFE2 that is connected to VM3).

As described above, in some embodiments, the LFEs (logical routers andswitches) of a logical network are implemented by each MFE that executeson a host machine. For example, when an MFE (e.g., MFE1) receives apacket from a DCN (e.g., VM1) that couples to a first port of a logicalswitch, the MFE performs the network forwarding processing for thelogical switch, to which the DCN logically couples. The MFE alsoperforms the forwarding processing for any additional LFE (e.g., logicalrouter processing if the packet is sent to an external network (e.g.,external network 170), logical router processing and processing foranother logical switch in the network if the packet is sent to an endmachine (DCN) coupled to the other logical switch, etc.).

Based on the forwarding processing, the MFE can decide where to send thereceived packet. For example, if the packet should be sent to VM3 thatis coupled to a second port of the LFE that is implemented by MFE2, MFE1sends the packet to MFE2 (through an established tunnel between MFE1 andMFE2), to be delivered to VM3. In the illustrated figure, the dashedlines that connect the management and control plane to the edge clusterand host machines represent the management and control plane dataexchange while the solid lines represent the data plane exchange betweenthe host machines and edge cluster.

When the logical network topology is connected to an external network,the management plane (e.g., a manager 115) automatically determineswhich edge nodes in the edge cluster 170 are the ideal candidates forimplementing the logical router(s) of the logical network. That is, themanagement plane identifies the best edge node candidate on which, aservice component of the logical router can be installed (configured).In some embodiments, the management plane makes such a determinationbased on a configuration policy. That is, in some embodiments, themanagement plane receives the configuration policy from a user (e.g.,the network administrator) and based on the received configurationpolicy, identifies the best edge nodes on which the logical routers ofthe logical network can be implemented. The management plane thenconfigures the logical routers on the identified edge nodesautomatically (i.e., solely based on the configuration policy andwithout any user intervention).

Before the management plane selects one or more edge nodes of an edgecluster as the candidates to configure the logical router, themanagement plane of some embodiments identifies the edge nodes on whichthe logical router should not and/or could not be realized. In some suchembodiments, after excluding a set of disqualified edge nodes, themanagement plane starts analyzing the remaining edge nodes forconfiguring the logical router. The management plane disqualifies theedge nodes based on a set of constraining rules. These constrainingrules, in some embodiments, includes, but is not limited to user definedconstraints, physical constraints, and product constraints.

A user constraint, as its name suggests, is an edge node selectionrestriction that is specified by a user. That is, the user defines oneor more rules to restrict the placement of logical routers on the edgenodes. For example, the user may specify that the logical routers of afirst tenant may not coexist with the logical routers of a second tenanton the edge nodes. As such, before the management plane configures thelogical routers of any of these two tenants, the management planeexcludes certain edge nodes of the edge cluster on which, a logicalrouter of the other tenant has already been configured.

A physical constraint includes one or more limitations in the systemresources of an edge node because of which, the edge node is not able torealize the logical router. For example, a particular logical router mayrequire a particular processor (CPU) architecture or a certain amount ofmemory that the edge node does not provide. As another example, alogical router may need a particular type of network interfacecontroller that does not exist on a particular edge node. A productconstraint, on the other hand, arises from a product internalrequirements. For example, certain services (e.g., stateful firewall) ina particular release of a product may need exclusive access to an edgenode. Such a restriction makes the edge node unavailable to a servicerouter of a logical router. Additionally, some services in a logicalrouter itself may require exclusive access to an edge node, in whichcase, the edge node cannot be shared by other logical routers. Theproduct constraints may be different in different releases of a product.

One of ordinary skill in the art would realize that the number of thehost machines, managers, controllers, edge nodes, and virtual machinesillustrated in the figure are exemplary and a logical network for atenant of a hosting system may span a multitude of host machines (andthird-party hardware switches), and logically connect a large number ofDCNs to each other (and to several other physical devices that areconnected to the hardware switches). Additionally, while shown as VMs inthis figure and other figures below, it should be understood that othertypes of data compute nodes (e.g., namespaces, containers, etc.) mayconnect to logical forwarding elements in some embodiments.

As described before, the management plane of some embodiments receives adefinition of a logical network and generates configuration data thatdefines the different logical forwarding entities of the logicalnetwork. One such logical forwarding entity of a logical network is alogical router. In some embodiments, the management plane receives adefinition of a logical router (e.g., through one or more API calls) anddefines a distributed logical router that includes several routingcomponents. Each of these routing components is separately assigned aset of routes and a set of logical interfaces (ports).

Each logical interface of each routing component is also assigned anetwork layer (e.g., Internet Protocol or IP) address and a data linklayer (e.g., media access control or MAC) address. In some embodiments,the several routing components defined for a logical router include asingle distributed router (also referred to as distributed routingcomponent) and several different service routers (also referred to asservice routing components). In addition, the management plane of someembodiments defines a transit logical switch (TLS) for handlingcommunications between the components internal to the logical router(i.e., between the distributed router and the service routers).

Some embodiments implement the distributed routing component of thelogical router in a distributed manner across the different MFEs in asame manner that a logical L2 switch spans the different MFEs. The MFEson which the distributed router (DR) is implemented includes (1)software MFEs that operate on the hypervisors of the host machines andedge nodes, and (2) other hardware VTEPs (e.g., third-party TORswitches). Some embodiments implement each of the service routingcomponents of the logical network on only an edge node (e.g., agateway), which is a machine at the edge of the network (e.g., thedatacenter network) in some embodiments, in order to communicate withone or more external networks. Each service router (SR) has an uplinkinterface for communicating with an external network as well as a TLSinterface for connecting to the transit logical switch and communicatingthe network data with the distributed routing component of the logicalrouter that is also connected to the TLS.

The SRs of a logical router, in some embodiments, may be configured inactive-active or active-standby mode. In active-active mode, all of theservice components are fully functional at the same time, and trafficcan ingress or egress from the logical network through the servicecomponents using equal-cost multi-path (ECMP) forwarding principles(balancing the traffic across the various service routing components).In this mode, each logical interface of each separate service componenthas unique IP and MAC addresses for communicating with an externalnetwork and/or with the distributed component (through the transitlogical switch).

In the active-active mode, the management plane of some embodimentsapplies the same configuration rules for placement of each servicerouter on an edge node. That is, the manager analyzes the same set ofrules in the configuration policy to configure the first active SR ofthe logical router on a first edge node and the second active SR on asecond edge node. For active-standby mode, however, some embodiments maydefine two different sets of rules for the active and standby edge nodeselections.

Some other embodiments may define the same set of rules forconfiguration of both active and standby service routers on the edgenodes. The configuration policy may also specify a static bindingbetween active and standby edge node selections. For example, a user maydefine a rule in the logical router configuration policy which specifieswhen an active SR is configured on a first edge node, a secondparticular edge node should host a corresponding standby SR.

In some embodiments, the logical router is part of a multi-tier logicalnetwork structure. For example, a two-tier logical router structure ofsome embodiments includes (1) a single logical router (referred to as aprovider logical router or PLR) for connecting the logical network(along with other logical networks) to one or more networks external tothe hosting system, and (2) multiple logical routers (each referred toas a tenant logical router or TLR) that connect to the PLR and do notseparately communicate with the external network. In some embodiments aPLR is defined and administrated by a user at the hosting system (e.g.,a datacenter network administrator), while each TLR is defined andadministered by a tenant of the hosting system (or both by the tenantand a user from the datacenter). In some embodiments, the managementplane defines a transit logical switch between the distributed componentof the PLR and the service components of the TLR. The concepts of TLRand PLR are described in more detail below by reference to FIGS. 9-11.

Some embodiments provide other types of logical router implementationsin a physical network (e.g., a datacenter network) such as a centralizedlogical router. In a centralized logical router, L3 logical routingfunctionalities are performed in only gateway machines, and themanagement plane of some embodiments does not define any distributedrouting component and instead only defines multiple service routingcomponents, each of which is implemented in a separate gateway machine.Different types of logical routers (e.g., distributed logical router,multi-layer logical routers, centralized logical router, etc.) andimplementation of the different types of logical routers on edge nodesand managed forwarding elements operating on host machines of adatacenter are described in greater detail in the U.S. patentapplication Ser. No. 14/814,473, filed Jul. 30, 2015.

Logical routers, in some embodiments, can be viewed from three differentperspectives. The first of these views is the API view, or configurationview, which is how the user (e.g., a datacenter provider or a tenant)views and defines the logical router in a logical network topology. Thesecond view is the management plane view, which is how the managementcluster (e.g., a manager machine or application in the managementcluster) internally defines (i.e., generates the configuration data of)the logical router. Finally, the third view is the physical realization,or implementation of the logical router, which is how the logical routeris actually implemented on different physical nodes in the physicalnetwork infrastructure (e.g., a datacenter network infrastructure).

FIGS. 2-4 illustrate the above-described different views of a logicalrouter in a logical network that logically connects different endmachines to each other and to other networks through different managedforwarding elements. More specifically, FIG. 2 illustrates aconfiguration view of a logical router, which represents a logicalnetwork as designed (or defined) by a user. The figure shows a logicalnetwork 200 that includes a logical router 215 and two logical switches205 and 210. The logical router 215 has two logical ports that areconnected to the logical switches 205 and 210. Logical switch 205 haslogical ports that are coupled to virtual machines VM1 and VM2, whilethe logical switch 210 has logical ports that are coupled to virtualmachines VM3 and VM4. The logical router 215 also includes two logicalports that are connected to the external physical network 220.

In some embodiments, a user (e.g., datacenter network administrator, atenant, etc.) defines each of the logical network entities 205-215through a set of API calls. For example the user executes an API call tocreate the logical switch 205 and two more API calls to create the twological ports of the logical switch that are coupled to the virtualmachines VM1 and VM2. Similarly, the user executes a set of API calls togenerate the other logical forwarding elements that are shown in thefigure. These API calls are received by a manager of the network, whichin turn generates the configuration data for each logical networkelement and publishes the generated configuration data to the CCPcluster as well as other physical nodes of the physical network thatimplement the logical network entities.

FIG. 3 illustrates a management plane view 300 of the logical network200 that is shown in FIG. 2. The management plane view 300 for thedistributed implementation illustrates that the management plane, afterreceiving the definition of the logical router (e.g., the API calls fromthe user), creates a distributed routing component 330, two servicerouting component 340 and 350, and a transit logical switch 360 based onthe received logical router definition.

In some embodiments, the management plane generates separate routinginformation bases (RIBs) and/or forwarding information bases (FIBs) foreach of the service routers 330-350. That is, in addition to havingseparate objects created in the management plane, each of the SRs330-350 is treated as a separate router with separate routing tables.The transit logical switch 360 has different logical ports that coupleto each of the routing components 330-350 and each of these routingcomponents has an interface to logically connect to the transit logicalswitch 360.

In some embodiments, the DR 330 is always located on the southbound side(i.e., facing the data compute nodes of the logical network, rather thanfacing the external physical network) of the logical routerimplementation. Unless the logical router has no service component, someembodiments do not configure the uplinks of the logical router for thedistributed component, whose northbound interfaces instead couple to thetransit logical switch that is part of the logical router 215.

In some embodiments SRs 340 and 350 may deliver services (i.e.,functionalities beyond simply routing, such as NAT, firewall, loadbalancing, etc.) in addition to providing the connection between thelogical network and external physical networks. In some embodiments, theimplementation of the SRs is designed to meet several goals. First, theimplementation ensures that the services can scale out—that is, theservices assigned to a logical router may be delivered by any of theseveral SRs of the logical router. Second, some embodiments configurethe SR in such a way that the service policies may depend on routingdecisions (e.g., interface-based NAT). Finally, the SRs of a logicalrouter have the ability to handle failure (e.g., of the physical machineon which an SR operates, of the tunnels to that physical machine, etc.)among themselves without requiring the involvement of a centralizedcontrol plane or management plane (though some embodiments allow the SRsto operate at reduced capacity or in a suboptimal manner).

FIG. 4 illustrates the physical distributed implementation of thelogical router 215 of FIG. 2. More specifically, this figure shows howthe physical nodes of the physical network architecture 400 implementthe logical forwarding elements of the logical network architecture 200shown in FIG. 2. The physical nodes shown in this figure include twogateway machines 410 and 420, and two host machines 430 and 440. Thefigure also shows that an external network 405 is connected to the twogateway machines 410 and 420. Each of the illustrated physical nodesincludes an MFE 450 that operates in the virtualization software of thephysical nodes in some embodiments. The host machine 430 hosts thevirtual machines VM1 and VM3 along a set of other data compute nodes,while the host machine 440 hosts the virtual machines VM2 and VM4 alonga set of other data compute nodes.

Each MFE 450 implements the LFEs of the logical network by performingthe forwarding processing of the LFEs for the packets that are receivedfrom or sent to the corresponding VMs that are connected to the MFE,and/or the external network 405. For example, the first port of thelogical switch 205 shown in FIG. 2 is mapped to a physical port of MFE1that is coupled to VM1, while the second port of the logical switch 205is mapped to a physical port of MFE2 that is coupled to VM2. As statedabove, the two gateway machines 410 and 420 are connected to theexternal network 405 as well as the host machines 430 and 440. Each hostmachine hosts a set of end machines and executes an MFE 450. In additionto executing an MFE, each of the gateway machines also executes aservice router (e.g., an SR instance or application).

Although, in the illustrated example two end machines that are connectedto the same logical switch are hosted by two different host machines(e.g., VM1 and VM2 that are connected to the same logical switch,execute on two different host machines Host1 and Host2), two or more endmachines that are connected to a same logical switch might as welloperate on the same host machine. The virtual machines VM1 and VM3communicate (e.g., exchange network data) with each other, with thevirtual machines VM2 and VM4, and with the external network via themanaged forwarding elements that implement the logical entities of thelogical network 200 and the service routers.

In some embodiments, the MFEs 450 operating on the host machines arephysical software switches provided by the hypervisors or othervirtualization software of the host machines. These MFEs perform theentire first-hop forwarding processing for the logical switches 205 and210 on packets that are received from the virtual machines VM1-VM4 ofthe logical network 200 (unless the pipeline of the transit logicalswitch 360 of the MFE specifies to send the packet to a SR). The MFEsresiding on the host machines Host1 and Host2 may also implement logicalswitches (and distributed logical routers) for other logical networks ifthe other logical networks have VMs that reside on the host machinesHost1 and Host2 as well.

Since each MFE 450 may perform first hop processing, each MFE implementsall of the logical forwarding elements including the logical switches205 and 210 and the DR 330, as well as the TLS 360. As described above,the MFEs implement the logical forwarding elements of the logicalnetwork to which the local end machines are logically connected. TheseMFEs may be flow-based forwarding elements (e.g., Open vSwitch) orcode-based forwarding elements (e.g., ESX), or a combination of the two,in various different embodiments. These different types of forwardingelements implement the various logical forwarding elements differently,but in each case they execute a pipeline for each logical forwardingelement that may be required to process a packet.

In some embodiments, when the MFE receives a packet from a VM that iscoupled to the MFE, it performs the processing for the logical switch towhich that VM logically couples, as well as the processing for anyadditional logical forwarding elements (e.g., logical router processingif the packet is sent to an external network, logical router processingand processing for the other logical switch in the network if the packetis sent to an end machine coupled to the other logical switch, etc.).The management and control planes distribute the logical forwarding dataof the L2 logical switches 205 and 210, and the router 215 to the MFEs450 in order for the MFEs to implement these logical forwardingelements. Additionally, the management and control plane distribute thelogical forwarding data of the SRs to the gateway machines to connectthe virtual machines VM1-VM4 to each other and to the external network.

The distributed router 330 and the TLS 360, as shown in the figure, areimplemented across the MFEs 450 (e.g., in the same manner that the otherlogical forwarding elements are implemented). That is, the datapaths(e.g., in the MFEs 450, or in a different form factor on the gatewaymachines) all include the necessary processing pipelines for the DR 330and the TLS 360. Unlike the DR, each of the two service routers 340 and350 operates on a single gateway machine. In some embodiments an SR maybe implemented as a virtual machine or other type of container. Thechoice for the implementation of an SR, in some embodiments, may bebased on the services chosen for the logical router and which type of SRbest provides those types of services.

In some embodiments, the edge nodes 410 and 420 are host machines, whichhost service routers rather than user VMs. These edge nodes handle thenorth-south traffic of the logical network (e.g., connect the logicalnetwork 200 to the external network 404). As shown in the figure, eachof the gateway machines includes an MFE as well (i.e., GMFE1 and GMFE2),which are similar to the other MFEs operating on the other host machinesthat implement the logical forwarding elements of the logical network200. In the illustrated example, the service routers are shown asseparate modules from the MFEs that operate on the gateway machines.Different embodiments, however, may implement the SRs differently. Whilesome embodiments implement the SRs as VMs (e.g., when the MFE is asoftware switch integrated into the virtualization software of thegateway machine), some embodiments implement the SRs as virtual routingand forwarding (VRFs) elements within the MFE datapath (e.g., when theMFE uses DPDK for the datapath processing).

In either case, the MFE treats the SR as part of the datapath, but inthe case of the SR being a VM (or other data compute node) separate fromthe MFE, the MFE sends the packet to the SR for processing by the SRpipeline (which may include the performance of various services). Aswith the MFEs on the host machines Host1 and Host2, the GMFEs of thegateway machines, as described above, are configured to perform all ofthe distributed processing components of the logical network. Thedifferent MFEs and GMFEs that implement the logical forwarding elementsuse a tunnel protocol in order to exchange the network data between thedifferent elements of the logical network 200. In some embodiments, themanagement plane (e.g., a master manager in the management cluster)distributes configuration data to the MFEs and GMFEs (e.g., throughseparate controllers each of which is associated with a single MFEand/or GMFE), which includes forwarding data that defines how to set uptunnels between the MFEs.

For instance, the configuration data specifies the location (e.g., IPaddress) of each MFE as a tunnel endpoint (i.e., a software VTEP or ahardware VTEP in case of a TOR hardware switch implemented on a port ofthe MFE). The different MFEs receive the tunnel endpoint addresses ofthe other MFEs that implement the logical forwarding elements from theCCP cluster and store these addresses in the MFEs' corresponding VTEPtables. The MFEs then use these VTEP tables to establish tunnels betweeneach other. That is, each source VTEP (e.g., the VTEP that sends thenetwork data to a destination VTEP) uses its corresponding VTEP tabledata to encapsulate the packets received form a source VM. The sourceVTEP encapsulate the packets using a particular tunnel protocol (e.g.,VXLAN protocol), and forwards the packets towards the destination VTEP.The destination VTEP then decapsulates the packets using the sameparticular tunnel protocol and forwards the packets towards adestination VM.

As an example, when VM1 sends a northbound packet to the externalnetwork 405, the datapath on MFE1 initially runs the source logicalswitch 205 pipeline (e.g., based on the ingress port through which thepacket is received, the source MAC address, etc.). This pipelinespecifies to forward the packet to the DR 330, the pipeline for whichalso takes place on the source MFE1. This pipeline identifies one of theSRs 340 and 350 as its next hop. In the active-standby case, thepipeline identifies the active SR; in the active-active case, someembodiments use ECMP to select one of the SRs. Next, the source MFE1executes the pipeline for the transit logical switch 360, whichspecifies to tunnel the packet to the appropriate gateway machine (edgenode) that hosts the selected SR (e.g., SR1 running on the gatewaymachine 410). The MFE1 then encapsulates the packet with the requireddata to send the packet to the GMFE1 (e.g., MFE1 adds its own IP addressto the outer packet header as the source VTEP and the IP address of theGMFE as the destination VTEP).

The gateway machine (e.g., the GMFE1 running on the gateway machine)receives the packet, decapsulates it (i.e., removes the tunneling datain the outer header of the packet), and identifies the SR1 based on thelogical context information on the packet (e.g., the VNI of the transitlogical switch 360) as well as the destination MAC address thatcorresponds to the SR1's southbound interface. The SR1 pipeline is thenexecuted (e.g., by a VM implementing the SR1 in some embodiments, by theGMFE1 in other embodiments). The SR1 pipeline ultimately sends thepacket to the physical external network. In some embodiments each SR'snorthbound interface is coupled to a physical router that receives thepacket from the SR and distributes the packet towards its finaldestination in the external network.

As another example, when a packet that is destined for VM4 executing onthe host machine 440 is received at gateway machine 420 (at the SR 350),the SR pipeline identifies the DR 330 as its next hop. The GMFE2operating on the gateway machine then executes the transit logicalswitch 360 pipeline, which forwards the packet to the DR 330, as well asthe DR 330 pipeline, which routes the packet towards its destination.The destination logical switch pipeline (i.e., the logical switch 210)is also executed on the GMFE2, which specifies to tunnel the packet tothe MFE2 of the host machine 440 on which the destination virtualmachine VM4 resides. After decapsulating the packet, the destinationMFE2 delivers the packet to the virtual machine VM4.

FIG. 5 conceptually illustrates a process 500 of some embodiments thatconfigures a logical router on one or more edge nodes of a physicalinfrastructure that implements one or more logical networks. The process500, in some embodiments, is performed by a manager computer orapplication in the management cluster that manages one or more logicalnetworks for one or more tenants of a hosting system. The process startsby receiving (at 510) a logical router definition. The process receivesthe definition of the logical router along with other logical entitiesof the logical network from a user such as a tenant of the datacenter ora datacenter manager through a set of API calls. The process of someembodiments then generates, as part of the configuration data for thelogical router, a DR and a set of SRs based on the received definitionof the logical router.

The process then identifies (at 520) one or more edge nodes in an edgecluster for configuring the identified set of SRs (as well as the DR andother necessary logical network entities). The process of someembodiments, as will be described in more detail below by reference toFIG. 6, identifies the edge nodes based on a set of rules that is storedin a logical router policy of the logical network in a database. Theprocess first identifies how many SR is generated in the logicalrouter's configuration data and then for each SR, identifies the nextbest candidate in the set of edge nodes based on the logical routerconfiguration policy.

Finally, the process configures (at 530) each SR on the identified edgenode. The process of some embodiments also configures the DR of thelogical router in the edge nodes as well as other host machines thatimplement the logical router. The IP and MAC addresses and otherconfiguration details assigned to the interfaces of the logical router(e.g., the four interfaces of the logical router 215 in FIG. 2) as partof the logical router definition are used to generate the configurationfor the various components (i.e., the distributed component and theservice routing components) of the logical router.

In addition, as part of the configuration, some embodiments generate arouting information base (RIB) for each of the logical routercomponents. That is, although the administrator defines only a singlelogical router, the management plane and/or control plane of someembodiments generates separate RIBs for the DR and for each of the SRs.For the SRs of a PLR, in some embodiments the management plane generatesthe RIB initially, but the physical implementation of the SR also runs adynamic routing protocol process (e.g., BGP, OSPF, etc.) to supplementthe RIB locally.

The specific operations of the process 500 may not be performed in theexact order shown and described. The specific operations may not beperformed in one continuous series of operations, and different specificoperations may be performed in different embodiments. For example, someembodiments do not identify all the edge nodes first and then configurethe SRs on the identified edge nodes the way it was described in thisfigure, In some such embodiments, the process first selects the first SRof the logical router and identifies the appropriate edge node on whichto install the SR. After configuring the SR on the identified edge node,these embodiments select the next SR of the logical router and identifythe next appropriate edge node on which to configure the next SR.Additionally, one of ordinary skill in the art would realize that theprocess 500 could be implemented using several sub-processes, or as partof a larger macro process.

As described above, the management plane (e.g., a manager computer orapplication in the management cluster) of some embodiments identifiesthe edge nodes for configuring the service routers of a logical routerbased on a logical router configuration policy that is stored in adatabase. The configuration policy, in some embodiments, includes a setof rules that determines the selection of the edge nodes (e.g., gatewaymachines). In some embodiments, the set of rules is ordered based on aranking that is assigned to each rule. In some such embodiments, themanagement plane first tries to identify an edge node that matches thehighest ranked rule. That is, the management plane identifies thegateway machine that satisfies the specification that is set forth inthe rule. In some such embodiments, when the management plane does notfind any match for the highest ranked rule, the management plane triesto find an edge node that matches the next highest ranked rule in theset of rules. If none of the edge nodes satisfies any of the policyrules, the management plane of some embodiments selects a firstavailable edge node on which the logical router can be configured.

FIGS. 6A-6B illustrate a manager of a logical network that receives thelogical router configuration policy and logical network's definitionfrom one or more users and configures the logical routers of the logicalnetwork on one or more edge nodes based on the configuration policy.Configuration of other logical network elements such as logical switcheson the edge nodes and other physical nodes of the hosting system is notshown in these figures for the simplicity of description. Specifically,FIG. 6A shows a user (e.g., network administrator) 610, a managermachine 630, three gateway machines 660-670, and an external network680. The manager includes, among other modules and storages, aconfigurator module 640, a policy database 645, and a statisticsdatabase 650. The figure also shows a policy file 620 that is fed to themanager 630 by the user 610.

As illustrated in the figure, in some embodiments, after a user definesthe policy 620 that includes a set of rules, the manager 630 stores thepolicy in the policy database 645. The set of rules that is provided bythe user, in some embodiments, includes a rule for placing the logicalrouters on the edge nodes based on a bin packing algorithm. That is,some embodiments calibrate (1) each logical router's capacity (e.g.,predict the resource requirements for the logical router) and (2) eachedge node's capacity (e.g., into a range of bins). These embodimentsthen place the logical routers on the edge nodes such that an optimumnumber of logical routers fits into each edge node. Some embodimentscategorize the logical routers into different capacity groups (e.g.,compact, medium, large, etc.) after the calibration and place thelogical routers on the edge nodes (e.g., when the edge nodes are alsocategorized into a set of groups that correspond to different logicalrouter capacity groups).

In some embodiments, the set of rules may also include a rule thatselects the next edge node in a sequential order (i.e., in the roundrobin approach); a rule that selects an edge node that has been leastrecently used; an edge node that has lowest connections (to the externalnetworks); an edge node, through which, the lowest network trafficpasses (i.e., the lowest number of packets is sent, through the edgenode, to the external networks); an edge node, on which, the lowestnumber of logical routers (irrespective of their capacity) isconfigured; an edge node that has the lowest amount of aggregatedcapacity of configured logical routers; an edge node, on which otherlogical routers of the logical network (e.g., on a north-south path)have already been configured; and an edge node, on which, the leastnumber of logical routers of other logical networks is configured (e.g.,to ensure fault isolation between the tenants); or any other rules thatthe user defines.

As described above, each rule has a rank assigned to it in someembodiments. In some such embodiments, the configurator 645 retrievesthe first ranked rule in the database 660. The configurator thenanalyzes the gateway machines 660-670 to determine whether any of thegateway machines matches the retrieved rule. If a machine matches therule, the configurator configures the logical router on that machine. Ifnone of the gateway machines matches the rule, the configurator goesdown the ranked rules in the database until the configurator finds thehighest ranked rule that applies to one of the edge nodes 660-670. Whenthe next highest ranked rule that is retrieved from the database 645applies to two or more of the edge nodes 660-670 (in an edge clusterthat has many more edge nodes), the configuration 640 selects the firstedge node that is being analyzed in some embodiments.

Some embodiments use the same data storage to store different databasesthat include different policies for different tenants of a hostingsystem. Some other embodiments use different data storages each of whichstores the policy data that is related to each tenant. Yet, in someother embodiments, each data storage of a set of data storages storesone or more databases that keep policy data for one or more tenants of adatacenter.

The management plane of some embodiments also deploys one or moredatabases (in one or more data storages) that keep the necessarystatistical data, based on which, the management plane decides whichedge node is the best candidate for configuring the defined logicalrouters. That is, after the manager retrieves the highest ranked rulefrom the configuration policy database 645, in some embodiments, themanger looks into a statistics database 650 to determine which of theedge nodes match the retrieved rule. In other words, the configuratormodule 640 uses the statistics database 650 to determine the bestcandidate edge node based on the selection approach that is dictated bythe rules received from the configuration policy database 645.

For instance, in a round robin approach, the management plane startswith the first available edge node and selects each next edge node basedon the data the management plane in statistics database which keepstrack of edge nodes that have not been selected yet. As another example,the database, in some embodiments, keeps track of the number ofconnections that each edge node have established to the externalnetworks. This way, when the policy dictates that the best candidate isan edge node that has the least number of connections, the managementplane identifies the best candidate by querying the database for an edgenode with the least number of connections.

In some embodiments, the management plane queries the edge nodes of theedge cluster in order to receive these statistical data and update thedata kept in the statistics database 650. In some other embodiments, themanagement plane updates this database on a transaction basis. Forexample, each time the management plane configures a logical router(i.e., a service router of the logical router) on an edge node orremoves a logical router from an edge node, the management plane updatesthe statistical data in the database. Yet, in some other embodiments,the management plane employs both of these methods. That is, themanagement plane updates the database with each new transaction, and atthe same time, the management plane queries the edge nodes uponoccurrence of an event (e.g., periodically) to receive more preciseinformation to store in the database. In some other embodiments, themanagement plane does not keep the statistical information in a separatedatabase (as shown in FIG. 8 below). In some such embodiments, themanager iteratively analyzes the available edge nodes and configures thelogical router on the first available edge node that matches theconfiguration rule (retrieved from the configuration policy).

FIG. 6B shows a user (e.g., a datacenter's tenant) 615 defining alogical network by providing definition of each logical network elementto the manager 630. More specifically, the user 615 defines, throughdifferent API calls 690, the different elements (e.g., logical switches,logical routers, logical middleboxes, etc.) of the logical network forthe manager 630. Each set of API calls instructs the manger to generateconfiguration data for one specific element of the logical network. Asshown, the API calls are for generating two logical switches LS1 andLS2, and a logical router LR (although shown as one API call for eachelement, in reality, more than one call may be required for each elementto be created).

The manager 630, in turn, generates configuration data for each of thelogical network elements and configures these elements on differentphysical nodes of the physical network infrastructure. As describedabove, in some embodiments, the logical switches are configured on oneor more host machines (i.e., on the MFEs of the host machines) as wellas one or more gateway machines (i.e., on the MFEs of the host machines)that connect the physical network infrastructure, on which the logicalnetwork is implemented, to external networks.

The manager 630 also configures the logical router on the host machinesand gateway machines. That is, the manager first generates configurationdata for a DR of the logical router and a set of SRs of the logicalrouter based on the logical router's definition received from the user.The manager then configures the DR the MFEs of both of the host machinesand gateway machines. The manger (e.g., the configurator 640 of themanager) also configures the set of SRs on the gateway machines in themanner that is described above and below. Although users 610 and 615illustrated in FIGS. 6A and 6B are shown as two separate users, in someembodiment both of these users can be the same (e.g., a tenant of adatacenter, a network administrator of the datacenter, etc.).

FIG. 7 conceptually illustrates a process 700 of some embodiments thatautomatically selects an edge node of an edge cluster based on aconfiguration policy, and configures a logical router (i.e., a servicerouting component of the logical router) on the selected edge node. Insome embodiments the process is performed by a manager machine (orapplication) in a management cluster of a hosting system. The methodstarts by receiving (at 710) the configuration policy for configuringthe logical routers. In some embodiments, the configuration policy forlogical routers is part of a larger policy that specifies otherconfiguration rules for other logical network elements. In some otherembodiments, the configuration policy is specifically for configuringlogical routers of the logical network.

As described above, the configuration policy includes a set of rankedrules each specifies a condition for an edge node that must be satisfiedin order to configure a service router of the logical router on the edgenode. An example of the rules includes a rule that specifies selectionof an edge node on which a logical router that belongs to a differenttier of the logical network is configured. Another example rulespecifies that the service router be configured on an edge node thatcurrently passes the least amount of network traffic. A third example isa rule that specifies selection of an edge node on which the leastnumber of other service routers are configured. For instance, when thereare multiple PLRs configured on multiple edge nodes and there is new PLRthat should be configured in the network, the configuring managerconfigures the SR(s) of the new PLR on an edge node(s) that implementthe least number of SRs of the different PLRs that have been previouslybeen configured.

The process 700 then receives (at 720) a definition of a logical router.As described above, the process receives the definition of the logicalrouter among other definitions of other logical forwarding elements froma user in order to configure a logical network topology on the physicalnodes of a physical network. Although not shown in this figure, theprocess of some embodiments generates different routing components ofthe logical router based on the received definition. The process thendetermines (at 730) whether any node in the edge cluster matches thespecification specified in the highest ranked rule of the policy. Such adetermination, as was described above, can be made by looking up therelated data for each edge node that is kept in one or more statisticsdatabases. Some other embodiments make such a determination, as will bedescribed in more detail below by reference to FIG. 8, by analyzing theedge nodes one by one and configuring the logical router on the firstmatched edge node. Some other embodiments utilize a combination of bothmethods.

When the process determines that an edge node in the edge clustermatches the specification of the highest ranked rule of the policy, theprocess configures (at 740) the logical router (e.g. an SR of thelogical router) on the edge node. The process then ends. On the otherhand, when none of the edge nodes matches the specification of the rule,the process determines (at 750) whether any more rules have left in theconfiguration policy that are not processed yet. When the processdetermines that there is a next highest ranked rule left in the policythat is not processed, the process selects (at 760) the next highestranked rule and returns to operation 730 to determine whether any of theedge nodes matches this newly selected rule. On the other hand, when theprocess determines that there is no more rule left in the policy, theprocess of some embodiments configures (at 770) the logical router onthe first available edge node. In some embodiments, the process returnsan error message when the process does not find any available gatewaymachine on which to configure the logical router.

The specific operations of the process 700 may not be performed in theexact order shown and described. The specific operations may not beperformed in one continuous series of operations, and different specificoperations may be performed in different embodiments. For example, someembodiments, after receiving the logical router's definition (at 720)and generating the logical router's different routing components, selectthe first SR of the logical router. These embodiments, then perform therest of the operation 730-760. After that, however, instead ofconfiguring the LR (at 770) on the identified node, these embodimentsconfigure the first SR on the identified node and then determine whetherthe LR has any more SRs.

If there are more SRs, these embodiments select the next SR and returnto operation 730 at which the process determines whether the newlyselected SR can be configured on any of the nodes and so on. The processin these embodiments ends when there are no more SRs left to examine.Additionally, one of ordinary skill in the art would realize that theprocess 700 could be implemented using several sub-processes, or as partof a larger macro process. For example, the operations 730 and 740 areperformed by the process illustrated in FIG. 8 and described below insome embodiments.

FIG. 8 conceptually illustrates a process 800 of some embodiments thatdetermines whether an edge node matches a selected rule, and configuresa logical router on the edge node when the node satisfies the specifiedrequirement in the rule. The manager of some embodiments performs thisprocess when some or all of the statistical data related to the edgecluster is not stored at and/or accessible by the manager. The processstarts (at 810) by identifying the first edge node of the edge cluster.In some embodiments, only a particular set of edge nodes of the edgecluster is assigned to each tenant, or set of tenants. In some otherembodiments, every edge node of the edge cluster can be used by one ormore tenants of the hosting system. That is, an SR of a particular PLRthat connects a logical network of a tenant to an external network (wheneach logical network is connected to the external networks through aparticular PLR) can be configured on any of the edge nodes of the edgecluster. In some such embodiments, however, if the particular PLR hasmore than one SR, each SR has to be configured on a separate edge node.

The process then determines (at 820) whether the selected node (whetherselected from a specific set of nodes or from all the nodes of the edgecluster) matches the rule of the configuration policy that is beingprocessed. When the process determines that the selected node satisfiesthe requirement that is specified in the rule, the process configures(at 830) the logical router (i.e., the SR of the logical router) on theselected node. The process then ends. On the other hand, when theprocess determines that the selected node does not satisfy the specifiedrequirement by the rule, the process determines (at 840) whether thereis any other edge node left in the edge cluster (or in a particular setof edge nodes that is assigned to the logical network).

When the process determines that there is an edge node left (in the edgecluster or a particular set of edge nodes), the process selects (at 850)the next edge node and returns to operation 820 to determine whether thenext selected node matches the rule. On the other hand, when the processdetermines that no more edge node is left in the edge cluster, theprocess ends. In some embodiments, when the process does not find anyedge node for the specified rule after processing all of the edge nodes,the process returns an error message, informing the user that thelogical router cannot be configured for the logical network.

The specific operations of the process 800 may not be performed in theexact order shown and described. The specific operations may not beperformed in one continuous series of operations, and different specificoperations may be performed in different embodiments. Additionally, oneof ordinary skill in the art would realize that the process 800 could beimplemented using several sub-processes, or as part of a larger macroprocess.

The configuration policy, as described above, includes a set of rankedrules that each specifies a condition for an edge node that must besatisfied in order to configure a service router of the logical routeron the edge node. An example of the rules includes a rule that requiresadjacency of the logical routers that belong to different tiers of thesame logical network. When two logical routers are implemented on thesame gateway machine, a packet does not have to be tunneled to adifferent gateway machine when one of the logical routers sends thepacket to the other logical router. As such, in order to improve thenetwork efficiency, the user may define a rule that requires adjacencyof implementation of the logical routers, which are situated ondifferent layers of the logical network, on a single gateway machine. Inorder to better understand this concept, several examples of amulti-tier logical network topology and implementing different logicalrouters of a logical network on the same gateway machine are describedbelow by reference to FIGS. 9-13.

The previous examples illustrate only a single tier of logical router.For logical networks with multiple tiers of logical routers, someembodiments may include both DRs and SRs at each level, or DRs and SRsat the upper level (the PLR tier) with only DRs at the lower level (theTLR tier). FIG. 9 conceptually illustrates a multi-tier logical network900 of some embodiments, with FIGS. 10 and 11 illustrating two differentmanagement plane views of the logical networks.

FIG. 9 conceptually illustrates a logical network 900 with two tiers oflogical routers. As shown, the logical network 900 includes, at thelayer 3 level, a provider logical router 905 and several tenant logicalrouters 910-920. The first tenant logical router 910 has two logicalswitches 925 and 930 attached, with one or more data compute nodescoupling to each of the logical switches (not shown in the figure). Forsimplicity of description, only the logical switches attached to thefirst TLR 910 are shown, although the other TLRs 915-920 would typicallyhave logical switches attached to them as well (e.g., logical switchesto which data compute nodes couple).

In some embodiments, any number of TLRs may be attached to a PLR such asthe PLR 905. Some datacenters may have only a single PLR to which allTLRs implemented in the datacenter attach, whereas other datacenters mayhave numerous PLRs. For instance, a large datacenter may want to usedifferent PLR policies for different tenants, or may have too manydifferent tenants to attach all of the TLRs to a single PLR. Part of therouting table for a PLR includes routes for all of the logical switchdomains of its TLRs, so attaching numerous TLRs to a PLR creates severalroutes for each TLR just based on the subnets attached to the TLR. ThePLR 905, as shown in the figure, provides a connection to the externalphysical network 935; some embodiments only allow the PLR to providesuch a connection, so that the datacenter provider can manage thisconnection. Each of the separate TLRs 910-920, though part of thelogical network 900, are configured independently (although a singletenant could have multiple TLRs if the tenant so chose).

FIGS. 10 and 11 illustrate different possible management plane views ofthe logical network 900, depending on whether or not the TLR 905includes a service routing component. In these examples, the routingaspects of the TLR 905 are always distributed using a DR. However, ifthe configuration of the TLR 905 includes the provision of statefulservices, then the management plane view of the TLR (and thus thephysical implementation) will include active and standby SRs for thesestateful services.

FIG. 10 illustrates the management plane view 1000 for the logicaltopology of FIG. 9 when the TLR 905 is completely distributed. Forsimplicity, only details of the first TLR 910 are shown; the other TLRswill each have their own single DRs, as well as set of SRs in somecases. As shown in the figure, the PLR 905 includes a DR 1005 and threeSRs 1010-1020, connected together by a transit logical switch 1025. Inaddition to the transit logical switch 1025 within the PLR 905implementation, the management plane also defines separate transitlogical switches 1030-1040 between each of the TLRs and the DR 1005 ofthe PLR. In the case in which the TLR 910 is completely distributed(FIG. 10), the transit logical switch 1030 connects to a DR 1045 thatimplements the configuration of the TLR 910 (i.e., TLR 910 will not havea separate SR). Thus, when a packet sent to a destination in theexternal network by a data compute node attached to the logical switch925, the packet will be processed through the pipelines of the logicalswitch 925, the DR 1045 of TLR 910, the transit logical switch 1030, theDR 1005 of the PLR 905, the transit logical switch 1025, and one of theSRs 1010-1020. In some embodiments, the existence and definition of thetransit logical switches 1025 and 1030-1040 are hidden from the userthat configures the network through the API (e.g., a networkadministrator), with the possible exception of troubleshooting purposes.

In some embodiments, when the second-tier logical router (e.g., TLR 910in the illustrated example) does not provide stateful services, themanagement plane does not generate a service routing component for thelogical router and therefore, does not configure a service router on anyof the edge nodes. In other words, in some embodiments, the managementplane only configures the SR(s) of the PLR on the edge node(s) of theedge cluster. The management plane of some embodiments also configuresthe SRs of any other second-tier logical router that provides statefulservices, on the edge cluster. As will be described in more detailbelow, in some such embodiments a user can define a configuration policythat specifies any second-tier logical router that provides statefulservices must be configured on an edge node adjacent to the first-tierlogical router (e.g., the SR of TLR 920 should be implemented on an edgenode that implements one of the SRs of the PLR 905).

FIG. 11 illustrates the management plane view 1100 for the logicaltopology 900 when the TLR 905 has at least one service routing component(e.g., because stateful services that cannot be distributed are definedfor the TLR). In some embodiments, stateful services such as firewalls,NAT, load balancing, etc. are only provided in a centralized manner.Other embodiments allow for some or all of such services to bedistributed, however. As with the previous figure, only details of thefirst TLR 910 are shown for simplicity; the other TLRs may have the samedefined components (DR, transit LS, and a set of SRs) or have only a DRas in the example of FIG. 10). The PLR 905 is implemented in the samemanner as in the previous figure, with the DR 1005 and the three SRs1010-1020, connected to each other by the transit logical switch 1025.In addition, as in the previous example, the management plane places thetransit logical switches 1030-1040 between the PLR and each of the TLRs.

The partially centralized implementation of the TLR 910 includes a DR1105 to which the logical switches 925 and 930 (LS A and LS B) attach,as well as two SRs 1110 and 1115. As in the PLR implementation, the DRand the two SRs each have interfaces to a transit logical switch 1120.This transit logical switch serves the same purposes as the transitlogical switch 1025, in some embodiments. For TLRs, some embodimentsimplement the SRs in active-standby manner, with one of the SRsdesignated as active and the other designated as standby. Thus, so longas the active SR is operational, packets sent by a data compute nodeattached to one of the logical switches 925 and 930 will be sent to theactive SR rather than the standby SR.

The above figures illustrate the management plane view of logicalrouters of some embodiments. In some embodiments, an administrator orother user provides the logical topology (as well as other configurationinformation) through a set of APIs. This data is provided to amanagement plane, which defines the implementation of the logicalnetwork topology (e.g., by defining the DRs, SRs, transit logicalswitches, etc.). In addition, in some embodiments, the management planeassociates each logical router (e.g., each PLR or TLR) with a set ofphysical machines (e.g., a pre-defined group of machines in an edgecluster) for deployment.

For purely distributed routers, such as the TLR 905 as implemented inFIG. 10, the set of physical machines is not important, as the DR isimplemented across the managed forwarding elements that reside on hostsalong with the data compute nodes that connect to the logical network.However, if the logical router implementation includes SRs, then theseSRs will each be deployed on specific physical machines. In someembodiments, the group of physical machines is a set of machinesdesignated for the purpose of hosting SRs (as opposed to user VMs orother data compute nodes that attach to logical switches). In otherembodiments, the SRs are deployed on machines alongside the user datacompute nodes.

In some embodiments, the user definition of a logical router includes aparticular number of uplinks. An uplink, in some embodiments, is anorthbound interface of a logical router in the logical topology. For aTLR, its uplinks connect to a PLR (all of the uplinks connect to thesame PLR, generally). For a PLR, its uplinks connect to externalrouters. Some embodiments require all of the uplinks of a PLR to havethe same external router connectivity, while other embodiments allow theuplinks to connect to different sets of external routers. Once a groupof machines (i.e., the edge cluster) for the logical router isspecified, if SRs are required for the logical router, the managementplane automatically assigns each of the uplinks of the logical router toan edge machine in the edge cluster based on the configuration policy.The management plane then creates an SR on each of the edge machines towhich an uplink is assigned. Some embodiments allow multiple uplinks tobe assigned to the same edge machine, in which case the SR on themachine has multiple northbound interfaces.

As mentioned above, in some embodiments the SR may be implemented as avirtual machine or other container, or as a VRF context (e.g., in thecase of DPDK-based SR implementations). In some embodiments, the choicefor the implementation of an SR may be based on the services chosen forthe logical router and which type of SR best provides those services.

In addition, the management plane of some embodiments creates thetransit logical switches. For each transit logical switch, themanagement plane assigns a unique virtual network identifier (VNI) tothe logical switch, creates a port on each SR and DR that connects tothe transit logical switch, and allocates an IP address for any SRs andthe DR that connect to the logical switch. Some embodiments require thatthe subnet assigned to each transit logical switch is unique within alogical L3 network topology having numerous TLRs (e.g., the networktopology 900), each of which may have its own transit logical switch.That is, in FIG. 11, transit logical switch 1025 within the PLRimplementation, transit logical switches 1030-1040 between the PLR andthe TLRs, and transit logical switch 1120 (as well as the transitlogical switch within the implementation of any of the other TLRs) eachrequires a unique subnet (i.e., a subdivision of the IP network).

Some embodiments place various restrictions on the connection of logicalrouters in a multi-tier configuration. For instance, while someembodiments allow any number of tiers of logical routers (e.g., a PLRtier that connects to the external network, along with numerous tiers ofTLRs), other embodiments only allow a two-tier topology (one tier ofTLRs that connect to the PLR). In addition, some embodiments allow eachTLR to connect to only one PLR, and each logical switch created by auser (i.e., not a transit logical switch) is only allowed to connect toone PLR or one TLR. Some embodiments also add the restriction thatsouthbound ports of a logical router must each be in different subnets.Thus, two logical switches may not have the same subnet if connecting tothe same logical router. Lastly, some embodiments require that differentuplinks of a PLR must be present on different gateway machines. Itshould be understood that some embodiments include none of theserequirements, or may include various different combinations of therequirements.

FIG. 12 illustrates a logical network topology 1200 that is connected toan external network through a set of logical routers. More specifically,this figure shows that a set of data compute nodes (e.g., virtualmachines) of a logical network (e.g., a logical network that belongs toa tenant of the datacenter) can exchange data with an external networkthrough multiple logical routers of the logical network that are placedin the same or different layers (tiers) of the logical topology. Thefigure includes an external network (e.g., Internet) 1205, two logicalrouters 1210 and 1220 that are directly connected to the externalnetwork (i.e., first-tier logical routers), a second-tier logical router1230, logical switches 1240-1280, and virtual machines VM1-VM8.

As illustrated in the figure, the first-tier logical router 1210 (e.g.,a PLR of the logical network) has an interface that is connected to theexternal network 1205 (e.g., through a physical hardware router), aninterface for the logical router 1230, and an interface for the logicalswitch 1240. Logical switch 1240 logically connects the two virtualmachines VM3 and VM4 to the other elements of the logical network aswell as other physical and/or logical network elements (through theexternal network 1205). The logical router 1230 (e.g., a TLR of thelogical network) has a northbound interface for the logical router 1210,a southbound interface for the logical switch 1250, and anothersouthbound interface for the logical switch 1260. These logical switcheslogically connect VM1 and VM2, respectively, to the other networkelements (local and external). The logical router 1220 (e.g., a PLR) hasa northbound interface that is connected to the external network 1205(e.g., through the same or different hardware router to which thelogical router 1210 is connected) and two southbound interfaces for thelogical switches 1270 and 1280 that logically connect VM5-VM8 to thelogical network.

Each of the virtual machines communicates with the other virtualmachines in the logical network and other network elements in theexternal network through a particular subset of logical switches androuters among the illustrated logical switches and routers of thefigure. For example, when virtual machine VM1 wants to send a packet tothe virtual machine VM6, the packet is sent to and processed by amanaged forwarding element that (1) runs in the same host machine as VM1and (2) implements a logical port of the logical switch 1250, to whichVM1 is logically connected. The MFE processes the packet by performingforwarding processing for the logical switch 1250, DR of logical router1230, DR of logical router 1220, and logical switch 1270.

After such forwarding processing, the MFE identifies the MFE on whichthe destination tunnel endpoint is implemented (could be the same MFE oranother MFE that runs on a different host machine). The identified MFEimplements the logical port of the logical switch 1270, to which VM6 isconnected. As such, if VM6 executes on the same host machine as VM1, thefirst MFE performs the forwarding processing for the logical port oflogical switch 1270, to which VM6 is connected and sends the packet toVM6. On the other hand, if VM6 runs on a different host machine, thefirst MFE tunnels the packet to the identified MFE which implements thelogical port of logical switch 1270, to which VM6 is connected. In thiscase, the identified MFE, in turn, performs the forwarding processingfor the logical port and sends the packet to the virtual machine VM6.

As another example when VM4 sends a packet to a remote virtual orphysical machine (through the external network 1205), an MFE thatexecutes in the same host machine (e.g., in the hypervisor of the hostmachine) as VM4 receives the packet and performs the forwardingprocessing on the packet for the logical switch 1240 and the logicalrouter 1210 (e.g., the DR of the logical router 1210). The DR (and TLS)of the logical router then realizes that the SR (e.g., active SR) of thelogical router is implemented by a gateway machine that is connected tothe external network. The packet is then tunneled to the gatewaymachine, which in turn performs additional forwarding processing (by anMFE that implements the destination tunnel endpoint that receives thepacket) and sends the packet to the SR of the logical router 1210. TheSR then sends the packet to a physical router that routes the packettowards its destination through the external network.

The logical routers 1210 and 1220 are different PLRs that connect thenetwork elements of the same logical network to the external network insome embodiments. The logical routers 1210 and 1220 can also be PLRs fordifferent logical networks that are implemented by the physical networkinfrastructure in some other embodiments. The logical router 1230, onthe other hand, is a tenant router (TLR) that connects to the externalnetwork only through the PLR 1210. As such, the network would be moreefficient if the SRs of the logical routers 1210 and 1230 that aredifferent tier routers of the same logical network be implemented on thesame edge node.

That is, when a packet is sent from the SR of the logical router 1230 tothe DR of the logical router 1210 (through a TLS), the network is moreefficient if the packet is processed on the same edge node (i.e.,gateway machine) instead of being sent from an edge node that implementsthe first SR to another edge node that implements the second SR.Therefore, a user can define a configuration policy that requires theSRs of the different tier logical routers be configured next to eachother (i.e., be configured on the same gateway machine).

FIGS. 13A-13B illustrate the physical implementation of the logicalnetwork elements shown in FIG. 12, on different edge nodes of an edgecluster. The figure does not show the implementations of the logicalentities on other host machines for simplicity of the description. FIG.13A includes the external network 1205 and three gateway machines1310-1330. Each gateway machine executes an MFE (e.g., in a hypervisorof the gateway machine) that implements the logical forwarding elements(i.e., logical switches and routers) of the logical network. That is,each MFE implements the logical switches LS1-LS3, the distributedcomponents DR1-DR3 of the logical routers, and the transit logicalswitches TLS1-TLS2. Each MFE also implements the TLS between the activeSR of the TLR (i.e., SR2) and the DR of the first PLR (i.e., DR1), whichis not shown in the figure.

As illustrated in FIG. 13A, the gateway machine 1310 also executes SR1340 which is the service routing component of the first PLR (LR1),while the gateway machine 1320 executes SR 1350 which is the servicerouting component of the second PLR (LR3). Each of these service routersconnects the logical network to the external network. Each of these SRsis configured on the edge nodes based on a set of policy rules that aredefined by a user and stored in the policy configuration database. Oneof the rules of the configuration policy defined by the user is thatwhen there is a multi-tier logical topology, the system must configurethe SRs (e.g., active SRs) of the logical routers on the same edge node(as long as the same edge node has the capacity for such configuration).

Therefore, when the user defines a second-tier logical router LR2(through one or more APIs), the management plane of some embodimentsgenerates the different components of the logical router and looks atthe configuration policy to determine on which edge node the SR of LR2should be configured. After analyzing the configuration policy (storedin the configuration database), the management plane realizes that thehighest ranked related rule specifies that a second-tier logical routerhas to be configured adjacent to the SR of the first-tier logical router(e.g., one of the SRs of the logical router LR1, which connects to theexternal network).

FIG. 13B illustrates the same elements that are shown in FIG. 13A withthe exception that in this figure, gateway machine 1310 is alsoexecuting the SR 1360, which is the SR of the second-tier logical routerLR2. As shown in the figure, the management plane has determined thatthe gateway machine 1310 is the ideal edge node for implementing SR2since this machine is also implementing SR1. The management plane hasmade such a determination based on the information that is stored in theconfiguration database as well as the information that is stored in thestatistics database.

That is, the configuration database (e.g., stored in a data storage on amanager computer) specifies that the SRs have to be adjacent and thestatistics database (e.g., stored in the same or different data storageon the manager computer) identifies the gateway machine 1310 as the edgenode that implements the service router of the first PLR (i.e., LR1). Asdescribed above, the manager computer of some embodiments does not storea statistics database that identifies the edge nodes. In some suchembodiments, the manager analyzes each edge node to identify the properedge node on which to configure the logical router.

In the illustrated example, the configuration policy might have two ormore different rules for configuring the logical routers on the edgenodes, each of which may apply to the current logical routerconfiguration, and each of which may point to a different edge node. Forinstance, a first rule of the policy may specify that logical routersshould be configured adjacent to each other (when they belong todifferent tiers of the same logical network). At the same time, a secondrule in the configuration policy might specify that each new logicalrouter should be configured on an edge node that implements the leastnumber of logical routers (i.e., an edge node that executes the leastnumber of SRs).

In such a circumstance, the management plane of some embodiments decideson which edge node to configure the logical router based on the rankingof the rules in the configuration database. For example, in theillustrated figure, if the adjacency rule has a higher ranking, themanagement plane configures SR2 on the gateway machine 1310 (as shown inthe figure). On the other hand, if an edge node with the least set up SRhas a higher priority, the management plane configures SR2 on thegateway machine 1330, since this machine, as shown, currently does notexecute any SR of any logical router of the logical network.

Some embodiments load balance the logical routers that are configured onthe different edge nodes of an edge cluster. That is, in someembodiments, the management plane, based on occurrence of a certainevent (e.g., user request, lapse of certain time period, node failure,configuration policy update, etc.) identifies the logical routers thatare implemented by each edge node, and based on the configurationpolicy, reassigns the different logical routers to different edge nodes.For instance, when an edge node fails or shuts down, the managementplane of some embodiments automatically reassigns the logical routers(among other logical entities) between the edge nodes that are stillactive in the edge cluster.

FIG. 14 conceptually illustrates a process 1400 of some embodiments thatload balances the implementations of different logical routers of one ormore logical networks among the different edge nodes of an edge cluster.Although in the illustrated figure, the process is activated in certaintime intervals, as described above, the triggering event for the processto be activated can be different events in different embodiments (e.g.,addition or removal of a certain number of logical routers, failure ofan edge node, etc.). The process 1400 is performed by the managementplane (e.g., a manager machine of a central management cluster of adatacenter) in some embodiments.

The process 1400 starts by resetting (at 1410) a timer. The process thendetermines (at 1420) whether a certain time period is lapsed. If thetime period is not lapsed the process loops back and waits till the timeis lapsed. When the process determines that the certain time period(e.g., a predefined time period) is lapsed, the process receives (at1420) the statistical data about the number and capacity of the logicalrouters that are configured on each edge node. In some embodiments theprocess receives this information from a database such as the statisticsdatabase 650 shown in FIG. 6 which stores all the statistical data ofevery edge node of the edge cluster. In some other embodiments, theprocess selects each edge node one by one and retrieves the statisticaldata about the logical routers that are configured on the edge nodesdirectly.

The process then reconfigures (at 1430) the logical routers on theactive edge nodes. The process of some embodiments reconfigures thelogical routers based on the configuration policy that is defined forthe process. That is, the process starts with the first logical router(e.g., the first SR of the logical router) and based on the rulesdefined in the policy configures the logical router on an edge node. Theprocess then iteratively configures each next logical router based onthe defined policy and the statistical data of the edge nodes until allof the logical routers are configured on the edge nodes.

The specific operations of the process 1400 may not be performed in theexact order shown and described. For example, instead of waiting for acertain period of time to lapse in order to start reconfiguring thelogical routers, some embodiments reconfigure the logical routers when auser updates the configuration policy. For example, when a user adds anew rule to the configuration policy, some embodiments activate theabove-described process. Some other embodiments activate the processonly when a new configuration policy (with a new set of rules) replacesthe previously defined configuration policy. As described above, othertriggering events for reconfiguring the logical routers could bedifferent in different embodiments. For example, some embodimentsactivate the process each time a defined number of logical routers isadded to the logical network. Some other embodiments activate theprocess each time an edge node fails (or is turned off).

The specific operations may not be performed in one continuous series ofoperations, and different specific operations may be performed indifferent embodiments. Additionally, one of ordinary skill in the artwould realize that the process 1400 could be implemented using severalsub-processes, or as part of a larger macro process.

Many of the above-described features and applications are implemented assoftware processes that are specified as a set of instructions recordedon a computer readable storage medium (also referred to as computerreadable medium). When these instructions are executed by one or morecomputational or processing unit(s) (e.g., one or more processors, coresof processors, or other processing units), they cause the processingunit(s) to perform the actions indicated in the instructions. Examplesof computer readable media include, but are not limited to, CD-ROMs,flash drives, random access memory (RAM) chips, hard drives, erasableprogrammable read-only memories (EPROMs), electrically erasableprogrammable read-only memories (EEPROMs), etc. The computer readablemedia does not include carrier waves and electronic signals passingwirelessly or over wired connections.

In this specification, the term “software” is meant to include firmwareresiding in read-only memory or applications stored in magnetic storage,which can be read into memory for processing by a processor. Also, insome embodiments, multiple software inventions can be implemented assub-parts of a larger program while remaining distinct softwareinventions. In some embodiments, multiple software inventions can alsobe implemented as separate programs. Finally, any combination ofseparate programs that together implement a software invention describedhere is within the scope of the invention. In some embodiments, thesoftware programs, when installed to operate on one or more electronicsystems, define one or more specific machine implementations thatexecute and perform the operations of the software programs.

FIG. 15 conceptually illustrates an electronic system 1500 with whichsome embodiments of the invention are implemented. The electronic system1500 may be a computer (e.g., a desktop computer, personal computer,tablet computer, etc.), server, dedicated switch, phone, PDA, or anyother sort of electronic or computing device. Such an electronic systemincludes various types of computer readable media and interfaces forvarious other types of computer readable media. Electronic system 1500includes a bus 1505, processing unit(s) 1510, a system memory 1525, aread-only memory 1530, a permanent storage device 1535, input devices1540, and output devices 1545.

The bus 1505 collectively represents all system, peripheral, and chipsetbuses that communicatively connect the numerous internal devices of theelectronic system 1500. For instance, the bus 1505 communicativelyconnects the processing unit(s) 1510 with the read-only memory 1530, thesystem memory 1525, and the permanent storage device 1535.

From these various memory units, the processing unit(s) 1510 retrievesinstructions to execute and data to process in order to execute theprocesses of the invention. The processing unit(s) may be a singleprocessor or a multi-core processor in different embodiments.

The read-only-memory (ROM) 1530 stores static data and instructions thatare needed by the processing unit(s) 1510 and other modules of theelectronic system. The permanent storage device 1535, on the other hand,is a read-and-write memory device. This device is a non-volatile memoryunit that stores instructions and data even when the electronic system1500 is off. Some embodiments of the invention use a mass-storage device(such as a magnetic or optical disk and its corresponding disk drive) asthe permanent storage device 1535.

Other embodiments use a removable storage device (such as a floppy disk,flash memory device, etc., and its corresponding drive) as the permanentstorage device. Like the permanent storage device 1535, the systemmemory 1525 is a read-and-write memory device. However, unlike storagedevice 1535, the system memory 1525 is a volatile read-and-write memory,such a random access memory. The system memory 1525 stores some of theinstructions and data that the processor needs at runtime. In someembodiments, the invention's processes are stored in the system memory1525, the permanent storage device 1535, and/or the read-only memory1530. From these various memory units, the processing unit(s) 1510retrieves instructions to execute and data to process in order toexecute the processes of some embodiments.

The bus 1505 also connects to the input and output devices 1540 and1545. The input devices 1540 enable the user to communicate informationand select commands to the electronic system. The input devices 1540include alphanumeric keyboards and pointing devices (also called “cursorcontrol devices”), cameras (e.g., webcams), microphones or similardevices for receiving voice commands, etc. The output devices 1545display images generated by the electronic system or otherwise outputdata. The output devices 1545 include printers and display devices, suchas cathode ray tubes (CRT) or liquid crystal displays (LCD), as well asspeakers or similar audio output devices. Some embodiments includedevices such as a touchscreen that function as both input and outputdevices.

Finally, as shown in FIG. 15, bus 1505 also couples electronic system1500 to a network 1565 through a network adapter (not shown). In thismanner, the computer can be a part of a network of computers (such as alocal area network (“LAN”), a wide area network (“WAN”), or an Intranet,or a network of networks, such as the Internet. Any or all components ofelectronic system 1500 may be used in conjunction with the invention.

Some embodiments include electronic components, such as microprocessors,storage and memory that store computer program instructions in amachine-readable or computer-readable medium (alternatively referred toas computer-readable storage media, machine-readable media, ormachine-readable storage media). Some examples of such computer-readablemedia include RAM, ROM, read-only compact discs (CD-ROM), recordablecompact discs (CD-R), rewritable compact discs (CD-RW), read-onlydigital versatile discs (e.g., DVD-ROM, dual-layer DVD-ROM), a varietyof recordable/rewritable DVDs (e.g., DVD-RAM, DVD-RW, DVD+RW, etc.),flash memory (e.g., SD cards, mini-SD cards, micro-SD cards, etc.),magnetic and/or solid state hard drives, read-only and recordableBlu-Ray® discs, ultra density optical discs, any other optical ormagnetic media, and floppy disks. The computer-readable media may storea computer program that is executable by at least one processing unitand includes sets of instructions for performing various operations.Examples of computer programs or computer code include machine code,such as is produced by a compiler, and files including higher-level codethat are executed by a computer, an electronic component, or amicroprocessor using an interpreter.

While the above discussion primarily refers to microprocessor ormulti-core processors that execute software, some embodiments areperformed by one or more integrated circuits, such as applicationspecific integrated circuits (ASICs) or field programmable gate arrays(FPGAs). In some embodiments, such integrated circuits executeinstructions that are stored on the circuit itself. In addition, someembodiments execute software stored in programmable logic devices(PLDs), ROM, or RAM devices.

As used in this specification and any claims of this application, theterms “computer”, “server”, “processor”, and “memory” all refer toelectronic or other technological devices. These terms exclude people orgroups of people. For the purposes of the specification, the termsdisplay or displaying means displaying on an electronic device. As usedin this specification and any claims of this application, the terms“computer readable medium,” “computer readable media,” and “machinereadable medium” are entirely restricted to tangible, physical objectsthat store information in a form that is readable by a computer. Theseterms exclude any wireless signals, wired download signals, and anyother ephemeral signals.

This specification refers throughout to computational and networkenvironments that include virtual machines (VMs). However, virtualmachines are merely one example of data compute nodes (DCNs) or datacompute end nodes, also referred to as addressable nodes. DCNs mayinclude non-virtualized physical hosts, virtual machines, containersthat run on top of a host operating system without the need for ahypervisor or separate operating system, and hypervisor kernel networkinterface modules.

VMs, in some embodiments, operate with their own guest operating systemson a host using resources of the host virtualized by virtualizationsoftware (e.g., a hypervisor, virtual machine monitor, etc.). The tenant(i.e., the owner of the VM) can choose which applications to operate ontop of the guest operating system. Some containers, on the other hand,are constructs that run on top of a host operating system without theneed for a hypervisor or separate guest operating system. In someembodiments, the host operating system uses name spaces to isolate thecontainers from each other and therefore provides operating-system levelsegregation of the different groups of applications that operate withindifferent containers. This segregation is akin to the VM segregationthat is offered in hypervisor-virtualized environments that virtualizesystem hardware, and thus can be viewed as a form of virtualization thatisolates different groups of applications that operate in differentcontainers. Such containers are more lightweight than VMs.

Hypervisor kernel network interface modules, in some embodiments, is anon-VM DCN that includes a network stack with a hypervisor kernelnetwork interface and receive/transmit threads. One example of ahypervisor kernel network interface module is the vmknic module that ispart of the ESXi™ hypervisor of VMware, Inc.

It should be understood that while the specification refers to VMs, theexamples given could be any type of DCNs, including physical hosts, VMs,non-VM containers, and hypervisor kernel network interface modules. Infact, the example networks could include combinations of different typesof DCNs in some embodiments.

Additionally, the term “packet” is used throughout this application torefer to a collection of bits in a particular format sent across anetwork. It should be understood that the term “packet” may be usedherein to refer to various formatted collections of bits that may besent across a network. A few examples of such formatted collections ofbits are Ethernet frames, TCP segments, UDP datagrams, IP packets, etc.

While the invention has been described with reference to numerousspecific details, one of ordinary skill in the art will recognize thatthe invention can be embodied in other specific forms without departingfrom the spirit of the invention. In addition, a number of the figures(including FIGS. 5, 7, 8, 14) conceptually illustrate processes. Thespecific operations of these processes may not be performed in the exactorder shown and described. The specific operations may not be performedin one continuous series of operations, and different specificoperations may be performed in different embodiments. Furthermore, theprocess could be implemented using several sub-processes, or as part ofa larger macro process. Thus, one of ordinary skill in the art wouldunderstand that the invention is not to be limited by the foregoingillustrative details, but rather is to be defined by the appendedclaims.

We claim:
 1. A method for configuring edge gateways for logical networksin a multi-tenant datacenter hosting a plurality of tenant logicalnetworks, the method comprising: receiving a definition of a particularlogical router for a particular tenant, the definition specifyingstateful services to perform on data messages between a logical networkof the particular tenant and one or more external networks; from ashared cluster of a plurality of edge gateways of the datacenter thatconnect the datacenter to one or more external networks, selecting anedge gateway to serve as the edge gateway for connecting the particulartenant logical network to the one or more external networks and forperforming the specified stateful services on data messages between theparticular tenant logical network and the one or more external networks;configuring the selected edge gateway to perform the specified statefulservices on data messages between the particular tenant logical networkand the one or more external networks as a centralized routing componentof the particular logical router; and configuring the selected edgegateway and at least one forwarding element connected to a set of datacompute nodes of the particular tenant logical network to implement adistributed component of the particular logical router.
 2. The method ofclaim 1, wherein configuring the selected edge gateway to perform thespecified stateful services on data messages between the particulartenant logical network and the one or more external networks comprisesconfiguring the selected edge gateway to implement a service routingcomponent of the particular logical router.
 3. The method of claim 1,wherein configuring the selected edge gateway to implement thedistributed routing component of the particular logical router comprisesconfiguring a forwarding element executing on the selected edge gatewayto implement the distributed routing component.
 4. The method of claim1, wherein the edge gateway is an active, first edge gateway, the methodfurther comprising: selecting a second edge gateway to serve as astandby edge gateway for connecting the particular tenant logicalnetwork to the one or more external networks and for performing thespecified stateful services on data messages between the particulartenant logical network and the one or more external networks; andconfiguring the standby, second edge gateway as a standby centralizedrouting component of the particular logical router to perform thespecified stateful services on data messages between the particulartenant logical network and the one or more external networks when thefirst edge gateway is unavailable.
 5. The method of claim 1, wherein theedge gateway is a first edge gateway, the particular tenant is a firsttenant, the particular logical router is a first logical router, thespecified stateful services are a first set of stateful services, andthe particular tenant logical network is a first tenant logical network,the method further comprising: receiving a definition of a secondlogical router for a second tenant, the definition of the second logicalrouter specifying a second set of stateful services to perform on datamessages between a second logical network of the second tenant and oneor more external networks; selecting the first edge gateway to serve asa standby edge gateway for connecting the second tenant logical networkof a second tenant to the one or more external networks and forperforming the second set of stateful services on between the secondtenant logical network and the one or more external networks; andconfiguring the first edge gateway as a standby centralized routingcomponent of the second logical router to perform the second set ofstateful services on data messages between the second tenant logicalnetwork and the one or more external networks when a second edge gatewayconfigured to perform the second set of stateful services on datamessages between the second tenant logical network and the one or moreexternal networks is unavailable.
 6. The method of claim 1, whereinselecting the edge gateway comprises selecting the edge gatewayaccording to an edge gateway selection rule of a gateway selection ruleset, the edge gateway selection rule comprising a set of edge gatewayselection criteria.
 7. The method of claim 6, wherein the gatewayselection rule set comprises a plurality of rules and the edge gatewayselection rule is a first rule for selecting an active edge gateway, themethod further comprising selecting a standby, second edge gatewayaccording to a second, different rule of the gateway selection rule setfor selecting a standby edge gateway.
 8. The method of claim 6, whereinthe gateway selection rule set is assigned to the particular tenant andthe set of edge gateway selection criteria comprises at least one of aphysical constraint, a product constraint, and a tenant definedconstraint.
 9. The method of claim 8, wherein the set of edge gatewayselection criteria specifies that a logical router of the particulartenant may not coexist on the same edge gateway with a logical router ofanother tenant.
 10. The method of claim 6, wherein the set of edgegateway selection criteria comprises a set of load balancing criteria,wherein selecting the edge gateway comprises selecting the edge gatewayaccording to the edge gateway selection rule and real time statisticaldata.
 11. The method of claim 6, wherein selecting the edge gatewayfurther comprises (i) before selecting the edge gateway, identifying,based on the gateway selection rules, a set of edge gateways in theplurality of edge gateways that are not qualified to perform thespecified stateful services on data messages associated with theparticular tenant logical network and (ii) selecting an edge gatewayfrom the plurality of edge gateways excluding the set of edge gatewaysidentified as not qualified.
 12. The method of claim 1, wherein the datamessages between the particular tenant logical network and the one ormore external networks comprise (i) data messages sent by data computenodes of the particular tenant logical network and (ii) data messagesaddressed to data compute nodes of the particular tenant logicalnetwork.
 13. A non-transitory machine readable medium storing a programfor execution by at least one processing unit and for configuring edgegateways for logical networks in a multi-tenant datacenter hosting aplurality of tenant logical networks, the program comprising sets ofinstructions for: receiving a definition of a particular logical routerfor a particular tenant, the definition specifying stateful services toperform on data messages between a logical network of the particulartenant and one or more external networks; selecting, from a sharedcluster of a plurality of edge gateways of the datacenter that connectthe datacenter to one or more external networks, an edge gateway toserve as the edge gateway for connecting the particular tenant logicalnetwork to the one or more external networks and for performing thespecified stateful services on data messages between the particulartenant logical network and the one or more external networks;configuring the selected gateway to perform the specified statefulservices on data messages between the particular tenant logical networkand the one or more external networks as a centralized routing componentof the particular logical router; and configuring the selected edgegateway and at least one forwarding element connected to a set of datacompute nodes of the particular tenant logical network to implement adistributed component of the particular logical router.
 14. Thenon-transitory machine readable medium of claim 13, wherein the set ofinstructions for configuring the selected edge gateway to perform thespecified stateful services on data messages between the particulartenant logical network and the one or more external networks comprises aset of instructions for configuring the selected edge gateway toimplement a service routing component of the particular logical router.15. The non-transitory machine readable medium of claim 13, wherein theset of instructions for configuring the selected edge gateway toimplement the distributed routing component of the particular logicalrouter comprises a set of instructions for configuring a forwardingelement executing on the selected edge gateway to implement thedistributed routing component.
 16. The non-transitory machine readablemedium of claim 13, wherein the edge gateway is an active, first edgegateway, the program further comprising sets of instructions for:selecting a second edge gateway to serve as a standby edge gateway forconnecting the particular tenant logical network to the one or moreexternal networks and for performing the specified stateful services ondata messages between the particular tenant logical network and the oneor more external networks; and configuring the standby, second edgegateway as a standby centralized routing component of the particularlogical router to perform the specified stateful services on datamessages between the particular tenant logical network and the one ormore external networks when the first edge gateway is unavailable. 17.The non-transitory machine readable medium of claim 13, wherein the edgegateway is a first edge gateway, the particular tenant is a firsttenant, the particular logical router is a first logical router, thespecified stateful services are a first set of stateful services, andthe particular tenant logical network is a first tenant logical network,the program further comprising sets of instructions for: receiving adefinition of a second logical router for a second tenant, thedefinition of the second logical router specifying a second set ofstateful services to perform on data messages between a second logicalnetwork of the second tenant and one or more external networks;selecting the first edge gateway to serve as a standby edge gateway forconnecting the second tenant logical network of a second tenant to theone or more external networks and for performing the second set ofstateful services on data messages between the second tenant logicalnetwork and the one or more external networks; and configuring the firstedge gateway as a standby centralized routing component of the secondlogical router to perform the second set of stateful services on datamessages between the second tenant logical network and the one or moreexternal networks when a second edge gateway configured to perform thesecond set of stateful services on data messages between the secondtenant logical network and the one or more external networks isunavailable.
 18. The non-transitory machine readable medium of claim 13,wherein the set of instructions for selecting the edge gateway comprisesa set of instructions for selecting the edge gateway according to anedge gateway selection rule of a gateway selection rule set, the edgegateway selection rule comprising a set of edge gateway selectioncriteria.
 19. The non-transitory machine readable medium of claim 13,wherein the data messages between the particular tenant logical networkand the one or more external networks comprise (i) data messages sent bydata compute nodes of the particular tenant logical network and (ii)data messages addressed to data compute nodes of the particular tenantlogical network.